-
Notifications
You must be signed in to change notification settings - Fork 65
feat: update e2e RAG eval #292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @e06084, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求引入了一个全新的、更真实的端到端 RAG 评估流程,专注于 FiQA 数据集。通过直接从 HuggingFace 加载数据并使用 Dingo 框架进行多维度评估,它提供了一个标准化的基准测试方法,取代了之前基于本地模拟数据的评估方式,从而显著提升了 RAG 系统评估的准确性和可复现性。 Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the end-to-end RAG evaluation examples. It removes an old example script and adds a more comprehensive one (e2e_RAG_eval_with_mockRAG_fiqa.py) that performs a full RAG evaluation on the FiQA dataset. The documentation is also updated to reflect these changes. The new example script is well-structured and provides a solid end-to-end workflow. My review focuses on improving its maintainability and robustness. I've suggested making default parameters more consistent, handling file paths more robustly, cleaning up return values, and reducing code duplication. These changes should make the example script easier to understand, use, and maintain.
| """简单的 RAG 系统""" | ||
|
|
||
| def __init__(self, llm_client: AsyncOpenAI, retriever: FiQACorpusRetriever, | ||
| system_prompt: str = None, model: str = "gpt-3.5-turbo"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for the model parameter is "gpt-3.5-turbo", which is inconsistent with the rest of the script that defaults to using "deepseek-chat" from the OPENAI_MODEL environment variable. This can be misleading for anyone using this class. To improve consistency and maintainability, I suggest making the default value consistent with the global configuration.
| system_prompt: str = None, model: str = "gpt-3.5-turbo"): | |
| system_prompt: str = None, model: str = "deepseek-chat"): |
| return { | ||
| "answer": answer, | ||
| "retrieved_documents": docs, | ||
| "context_list": [doc.page_content for doc in docs] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query method returns a retrieved_documents field which is a list of langchain_core.documents.Document objects. These objects are not JSON serializable, and this field is not used by the calling function generate_rag_responses (which uses context_list instead). To improve clarity and avoid returning unnecessary, non-serializable data, I suggest removing retrieved_documents from the return dictionary. A similar change should be applied to the return statement in the if not docs: block on lines 167-171.
return {
"answer": answer,
"context_list": [doc.page_content for doc in docs]
}| def print_metrics_summary(summary: SummaryModel): | ||
| """打印指标统计摘要""" | ||
| if not summary.metrics_score_stats: | ||
| print("⚠️ 没有指标统计数据") | ||
| return | ||
|
|
||
| print("\n" + "=" * 80) | ||
| print("📊 RAG 评估指标统计") | ||
| print("=" * 80) | ||
|
|
||
| for field_key, metrics in summary.metrics_score_stats.items(): | ||
| print(f"\n📁 字段组: {field_key}") | ||
| print("-" * 80) | ||
|
|
||
| for metric_name, stats in metrics.items(): | ||
| display_name = metric_name.replace("LLMRAG", "") | ||
| print(f"\n {display_name}:") | ||
| print(f" 平均分: {stats.get('score_average', 0):.2f}") | ||
| print(f" 最小分: {stats.get('score_min', 0):.2f}") | ||
| print(f" 最大分: {stats.get('score_max', 0):.2f}") | ||
| print(f" 样本数: {stats.get('score_count', 0)}") | ||
| if 'score_std_dev' in stats: | ||
| print(f" 标准差: {stats.get('score_std_dev', 0):.2f}") | ||
|
|
||
| overall_avg = summary.get_metrics_score_overall_average(field_key) | ||
| print(f"\n 🎯 该字段组总平均分: {overall_avg:.2f}") | ||
|
|
||
| metrics_summary = summary.get_metrics_score_summary(field_key) | ||
| sorted_metrics = sorted(metrics_summary.items(), key=lambda x: x[1], reverse=True) | ||
|
|
||
| print("\n 📈 指标排名(从高到低):") | ||
| for i, (metric_name, avg_score) in enumerate(sorted_metrics, 1): | ||
| display_name = metric_name.replace("LLMRAG", "") | ||
| print(f" {i}. {display_name}: {avg_score:.2f}") | ||
|
|
||
| print("\n" + "=" * 80) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function print_metrics_summary seems to be a utility function that is also present in other example scripts like examples/rag/dataset_rag_eval_baseline.py. To adhere to the DRY (Don't Repeat Yourself) principle and improve maintainability, consider moving this function to a shared utility module (e.g., examples/rag/utils.py) and importing it where needed.
| elif args.limit: | ||
| output_filename = f"fiqa_end_to_end_rag_output_limit_{args.limit}.jsonl" | ||
|
|
||
| output_path = "test/data/" + output_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using + for path concatenation is not robust and can lead to issues on different operating systems. It's better to use os.path.join() to construct file paths. Additionally, the output directory "test/data/" is hardcoded, which makes the script less flexible. Consider making this a configurable parameter, for example, via a command-line argument.
output_path = os.path.join("test/data", output_filename)
No description provided.